Genome-wide association study (GWAS)

This notebook demonstrates conducting a genome-wide association study using the public 1000 Genomes dataset stored in BigQuery. Specifically we will:

  • Use the %%sql statement to write and execute SQL statements within the notebook
  • Extract data from BigQuery and create a local dataset that can be manipulated in Python for visualization and further analysis

Related Links:


NOTE:

  • If you're new to notebooks, or want to check out additional samples, check out the full list of general notebooks.
  • For additional Genomics samples, check out the full list of Genomics notebooks.

Experiment design

In this experiment, we'll be identifying variant positions within chromosome 12 that differ significantly between the case and control groups. The case group for the purposes of this notebook will be individuals from the "EAS" (East Asian) super population. Variant data from the 1000 genomes dataset is publicly accessible within BigQuery.


In [1]:
import gcp.bigquery as bq
variants_table = bq.Table('genomics-public-data:1000_genomes.variants')
variants_table.schema


Out[1]:

Classifying per-call variant positions into variant/non-variant groups

We can tally the reference/alternate allele accounts for individual variant positions within chromosome 12. The field call.genotype is an integer ranging from [-1, num_alternate_bases].

  • A value of negative one indicates that the genotype for the call is ambiguous (i.e., a no-call).
  • A value of zero indicates that the genotype for the call is the same as the reference (i.e., non-variant).
  • A value of one would indicate that the genotype for the call is the 1st value in the list of alternate bases (likewise for values >1).

In [2]:
%%sql --module allele_counts

SELECT 
    reference_name,
    start,
    reference_bases,
    alternate_bases,
    end,
    vt,    
    call.call_set_name AS call_set_name,
    # 1000 genomes phase 1 data is bi-allelic so there is only ever a single alt
    SUM(0 = call.genotype) WITHIN RECORD AS ref_count,
    SUM(1 = call.genotype) WITHIN RECORD AS alt_count,
FROM
  FLATTEN((
    SELECT
      reference_name,
      start,
      reference_bases,
      alternate_bases,
      end,
      vt,
      call.call_set_name,
      call.genotype,
    FROM
      $variants_table
    WHERE
      reference_name = '12' -- i.e., chromosome 12
    ),
    call)

Let's verify that our allele counts match our expectations before moving on. For any given row, the alternate + reference counts should sum to 2 for this experiment.


In [3]:
bq.Query(allele_counts, variants_table=variants_table).sample().to_dataframe()


Out[3]:
reference_name start reference_bases alternate_bases end vt call_set_name ref_count alt_count
0 12 19434623 A G 19434624 SNP HG00261 2 0
1 12 19434623 A G 19434624 SNP HG00593 2 0
2 12 19434623 A G 19434624 SNP NA12749 2 0
3 12 19434623 A G 19434624 SNP HG00150 2 0
4 12 19434623 A G 19434624 SNP NA19675 2 0

Assigning case and control groups

Now we can join our allele counts with metadata available in the sample info table. We'll use this sample metadata to split the set of genomes into case and control groups based upon the super population group.


In [4]:
sample_info_table = bq.Table('genomics-public-data:1000_genomes.sample_info')

In [5]:
%%sql --module exp_groups

SELECT
  super_population,
  ('EAS' = super_population) AS is_case,
  call_set_name,
  reference_name,
  start,
  reference_bases,
  alternate_bases,
  end,
  vt,
  ref_count,
  alt_count,
FROM $allele_counts AS allele_counts
JOIN $sample_info_table AS samples
  ON allele_counts.call_set_name = samples.sample

In [6]:
bq.Query(exp_groups, allele_counts=allele_counts,
                     sample_info_table=sample_info_table,
                     variants_table=variants_table).sample().to_dataframe()


Out[6]:
super_population is_case call_set_name reference_name start reference_bases alternate_bases end vt ref_count alt_count
0 EUR False HG00261 12 18330083 C A 18330084 SNP 0 2
1 EAS True HG00593 12 18330083 C A 18330084 SNP 1 1
2 EUR False NA12749 12 18330083 C A 18330084 SNP 0 2
3 EUR False HG00150 12 18330083 C A 18330084 SNP 1 1
4 AMR False NA19675 12 18330083 C A 18330084 SNP 0 2

The variants table contains a few different types of variant: structural variants ("SV"), indels ("INDEL") and SNPs ("SNP").


In [7]:
%%sql
SELECT 
  vt,
  COUNT(*)
FROM $exp_groups
GROUP BY vt


Out[7]:

(rows: 3, time: 0.4s, cached, job: job_l3Z5dkMGEBkpl61jsNfJu6-GXmg)

For the purposes of this experiment, let's limit the variants to only SNPs.


In [8]:
%%sql --module snps
SELECT * 
FROM $exp_groups
WHERE vt='SNP'

In [9]:
bq.Query(snps,
         exp_groups=exp_groups,
         allele_counts=allele_counts,
         sample_info_table=sample_info_table,
         variants_table=variants_table).sample().to_dataframe()


Out[9]:
super_population is_case call_set_name reference_name start reference_bases alternate_bases end vt ref_count alt_count
0 EUR False HG00261 12 18372942 A G 18372943 SNP 2 0
1 EAS True HG00593 12 18372942 A G 18372943 SNP 2 0
2 EUR False NA12749 12 18372942 A G 18372943 SNP 2 0
3 EUR False HG00150 12 18372942 A G 18372943 SNP 2 0
4 AMR False NA19675 12 18372942 A G 18372943 SNP 2 0

Tallying reference/alternate allele counts for case/control groups

Now that we've assigned each call set to either the case or the control group, we can tally up the counts of reference and alternate alleles within each of our assigned case/control groups, for each variant position, like so:


In [10]:
%%sql --module grouped_counts

SELECT
    reference_name,
    start,
    end,
    reference_bases,
    alternate_bases,
    vt,
    SUM(ref_count + alt_count) AS allele_count,
    SUM(ref_count) AS ref_count,
    SUM(alt_count) AS alt_count,
    SUM(IF(TRUE = is_case, INTEGER(ref_count + alt_count), 0)) AS case_count,
    SUM(IF(FALSE = is_case, INTEGER(ref_count + alt_count), 0)) AS control_count,
    SUM(IF(TRUE = is_case, ref_count, 0)) AS case_ref_count,
    SUM(IF(TRUE = is_case, alt_count, 0)) AS case_alt_count,
    SUM(IF(FALSE = is_case, ref_count, 0)) AS control_ref_count,
    SUM(IF(FALSE = is_case, alt_count, 0)) AS control_alt_count,
FROM $snps
GROUP BY
    reference_name,
    start,
    end,
    reference_bases,
    alternate_bases,
    vt

Again, validate that the results are sensical for the group level counts (still per variant position).


In [11]:
bq.Query(grouped_counts,
         snps=snps,
         exp_groups=exp_groups,
         allele_counts=allele_counts,
         sample_info_table=sample_info_table,
         variants_table=variants_table).sample().to_dataframe()


Out[11]:
reference_name start end reference_bases alternate_bases vt allele_count ref_count alt_count case_count control_count case_ref_count case_alt_count control_ref_count control_alt_count
0 12 113369176 113369177 T G SNP 2184 499 1685 572 1612 124 448 375 1237
1 12 84117090 84117091 T A SNP 2184 2182 2 572 1612 572 0 1610 2
2 12 41729418 41729419 G A SNP 2184 1101 1083 572 1612 283 289 818 794
3 12 41768110 41768111 G A SNP 2184 1121 1063 572 1612 284 288 837 775
4 12 26584345 26584346 G A SNP 2184 103 2081 572 1612 0 572 103 1509

Quantify the statistical significance at each variant positions

We can quantify the statistical significance of each variant position using the Chi-squared test. Furthermore, we can restrict our result set to only statistically significant variant positions for this experiment by ranking each position by its statistical signficance (decreasing) and thresholding the results for significance at p <= 5e-8 (chi-squared score >= 29.7).


In [12]:
%%sql --module results

SELECT
  reference_name,
  start,
  end,
  reference_bases,
  alternate_bases,
  vt,
  case_count,
  control_count,
  allele_count,
  ref_count,
  alt_count,
  case_ref_count,
  case_alt_count,
  control_ref_count,
  control_alt_count,
  # http://homes.cs.washington.edu/~suinlee/genome560/lecture7.pdf
  # https://en.wikipedia.org/wiki/Yates%27s_correction_for_continuity
  ROUND(
    POW(ABS(case_ref_count - (ref_count/allele_count)*case_count) - 0.5,
      2)/((ref_count/allele_count)*case_count) +
    POW(ABS(control_ref_count - (ref_count/allele_count)*control_count) - 0.5,
      2)/((ref_count/allele_count)*control_count) +
    POW(ABS(case_alt_count - (alt_count/allele_count)*case_count) - 0.5,
      2)/((alt_count/allele_count)*case_count) +
    POW(ABS(control_alt_count - (alt_count/allele_count)*control_count) - 0.5,
      2)/((alt_count/allele_count)*control_count),
    3) AS chi_squared_score
FROM $grouped_counts
WHERE
  # For chi-squared, expected counts must be at least 5 for each group
  (ref_count/allele_count)*case_count >= 5.0
  AND (ref_count/allele_count)*control_count >= 5.0
  AND (alt_count/allele_count)*case_count >= 5.0
  AND (alt_count/allele_count)*control_count >= 5.0
HAVING
  # Chi-squared critical value for df=1, p-value=5*10^-8 is 29.71679
  chi_squared_score >= 29.71679
ORDER BY
  chi_squared_score DESC,
  allele_count DESC

We now run this query over all the variants within chromosome 12 and return the first few most significant variants.


In [13]:
bq.Query(results,
         grouped_counts=grouped_counts,
         snps=snps,
         exp_groups=exp_groups,
         allele_counts=allele_counts,
         sample_info_table=sample_info_table,
         variants_table=variants_table).sample().to_dataframe()


Out[13]:
reference_name start end reference_bases alternate_bases vt case_count control_count allele_count ref_count alt_count case_ref_count case_alt_count control_ref_count control_alt_count chi_squared_score
0 12 110571373 110571374 A G SNP 572 1612 2184 1813 371 220 352 1593 19 1086.505
1 12 22509094 22509095 T C SNP 572 1612 2184 1589 595 116 456 1473 139 1073.082
2 12 22525352 22525353 A C SNP 572 1612 2184 1585 599 115 457 1470 142 1068.266
3 12 22524240 22524241 T C SNP 572 1612 2184 1584 600 115 457 1469 143 1065.294
4 12 33240567 33240568 C G SNP 572 1612 2184 1678 506 159 413 1519 93 1043.029

Scroll to the right in the above results to see that the positions deemed significant do in fact have significantly different case/control counts for the alternate/reference bases.

Computing Chi-squared statistics in BigQuery vs Python vs R

Let's compare these BigQuery-computed Chi-squared scores to ones calculated via Python's statistical packages


In [14]:
import numpy as np
from scipy.stats import chi2_contingency

chi2, p, dof, expected = chi2_contingency(np.array([ 
    [220, 352], # case 
    [1593, 19]  # control
]))

print 'Python Chi-sq score = %.3f' % chi2


Python Chi-sq score = 1086.505

Comparing Python statistics to those computed in R separately with the result displayed here:

> chisq.test(rbind(c(220, 352), c(1593, 19)))

    Pearson's Chi-squared test with Yates' continuity correction

data:  rbind(c(220, 352), c(1593, 19))
X-squared = 1086.505, df = 1, p-value < 2.2e-16

And we can see for both the computation in Python and R, that the value matches 1086.505 from BigQuery.

Analyzing the GWAS results

First, how many statistically significant variant positions did we find?


In [15]:
%%sql 
SELECT COUNT(*) AS num_significant_snps
FROM $results


Out[15]:

(rows: 1, time: 0.5s, cached, job: job_LrwWjBy-rCHAGBsE2LpCikoKclA)

We now have a dataset that is sufficiently small to fit into memory on our instance, so let's pull the top 1000 SNP positions locally. Since we only need a subset of the columns, we can project our data first to remove unneeded columns.


In [16]:
%%sql --module sig_snps_dataset
SELECT * FROM (
  SELECT
    reference_name,
    start,
    reference_bases,
    alternate_bases,
    chi_squared_score
  FROM $results
  LIMIT 1000
)
ORDER BY start asc

In [17]:
sig_snps = bq.Query(sig_snps_dataset, 
                    results=results,
                    grouped_counts=grouped_counts,
                    snps=snps,
                    exp_groups=exp_groups,
                    allele_counts=allele_counts,
                    sample_info_table=sample_info_table,
                    variants_table=variants_table).to_dataframe()
sig_snps[:10]


Out[17]:
reference_name start reference_bases alternate_bases chi_squared_score
0 12 1595431 A G 744.125
1 12 1600977 A T 828.910
2 12 2129565 C T 711.515
3 12 2132494 T C 753.874
4 12 2134197 A G 822.961
5 12 2138687 C A 706.309
6 12 2139264 A C 769.534
7 12 2140624 T C 704.880
8 12 2141908 C T 790.150
9 12 6505528 T G 768.788

Let's visualize the distribution of significant SNPs along the length of the chromosome. The y-value of the charts indicates the Chi-squared score: larger values are more significant.


In [18]:
import matplotlib.pyplot as plt
import seaborn as sns

#g = sns.distplot(sig_snps['start'], rug=False, hist=False, kde_kws=dict(bw=0.1))
fig, ax = plt.subplots()
ax.scatter(sig_snps['start'], sig_snps['chi_squared_score'], alpha=0.3, c='red')
ax.set_ylabel('Chi-squared score')
ax.set_xlabel('SNP position (bp)')


Out[18]:
<matplotlib.text.Text at 0x7f956b1b8f50>

Let's zoom in on one region that contains a large number of very significant SNPs:


In [19]:
fig, ax = plt.subplots()
ax.scatter(sig_snps['start'], sig_snps['chi_squared_score'], alpha=0.5, c='red')
ax.set_xlim([3.3e7, 3.5e7])
ax.set_ylabel('Chi-squared score')
ax.set_xlabel('SNP position (bp)')


Out[19]:
<matplotlib.text.Text at 0x7f9540de9c50>

Further exploration of statistically significant variant positions

We can take our analysis further by mapping selected variant positions back to the chromosome and visualizing call sets and reads. Let's retrieve the top SNP identified when ranked by the Chi-squared score:


In [20]:
%%sql --module top_snp
SELECT start
FROM $sig_snps_dataset
ORDER BY chi_squared_score desc
LIMIT 1

In [21]:
bq.Query(top_snp,
         sig_snps_dataset=sig_snps_dataset,
         results=results,
         grouped_counts=grouped_counts,
         snps=snps,
         exp_groups=exp_groups,
         allele_counts=allele_counts,
         sample_info_table=sample_info_table,
         variants_table=variants_table).results()


Out[21]:

(rows: 1, time: 0.5s, cached, job: job_2uTBR9MmYxBWlv4lt3cSdw_dQmM)

Grab an arbitrary set of 10 callset IDs for rendering in the genome browser.


In [22]:
%%sql --module callset_ids
SELECT * FROM (
  SELECT call.call_set_id AS callset_id
  FROM $variants_table
  GROUP BY callset_id)
LIMIT 10

In [23]:
callsets_df = bq.Query(callset_ids, variants_table=variants_table).to_dataframe()
callsets = list(callsets_df['callset_id'])

In [24]:
from IPython.display import HTML
def gabrowse(dataset, reference_name, start_position, callset_ids):
    callsets_query_params = ''.join('&callsetId=%s&cBackend=GOOGLE' % callset_id for callset_id in callset_ids)
    url = ('https://gabrowse.appspot.com/#=&backend=GOOGLE&location=12%3A'
         + str(start_position)
         + callsets_query_params)
    return HTML('<iframe src="%s" width=1024 height=800></iframe>' % url)

Now we can render the call sets and reads for the selected SNP position by embedding the GABrowse application directly in our notebook.


In [25]:
gabrowse('1000genomes', '12', 110571373, callsets)


Out[25]:

Summary

This notebook illustrated how to conduct a GWAS experiment using variant data stored within the Google Genomics BigQuery tables, retrieve a local copy of the top results and visualize the data with Python libraries.